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Abstract 

We discuss some methods to quantitatively investigate the properties of correlation 
matrices. Correlation matrices play an important role in portfolio optimization and 
in several other quantitative descriptions of asset price dynamics in financial mar- 
kets. Specifically, we discuss how to define and obtain hierarchical trees, correlation 
based trees and networks from a correlation matrix. The hierarchical clustering and 
other procedures performed on the correlation matrix to detect statistically reliable 
aspects of the correlation matrix are seen as filtering procedures of the correlation 
matrix. We also discuss a method to associate a hierarchically nested factor model 
to a hierarchical tree obtained from a correlation matrix. The information retained 
in filtering procedures and its stability with respect to statistical fluctuations is 
quantified by using the Kullback-Leibler distance. 

Key words: multivariate analysis, hierarchical clustering, correlation based 
networks, bootstrap validation, factor models, Kullback-Leibler distance. 
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1 Introduction 



Many complex systems observed in the physical, biological and social sciences 
are organized in a nested hierarchical structure, i.e. the elements of the system 
can be partitioned in clusters which in turn can be partitioned in subclusters 
and so on up to a certain level (Simon, 1962). The hierarchical structure of 
interactions among elements strongly affects the dynamics of complex sys- 
tems. Therefore a quantitative description of hierarchies of the system is a 
key step in the modeling of complex systems (Anderson, 1972). The analy- 
sis of multivariate data provides crucial information in the investigation of 
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a wide variety of systems. Multivariate analysis methods are designed to ex- 
tract the information both on the number of main factors characterizing the 
dynamics of the investigated system and on the composition of the groups 
(clusters) in which the system is intrinsically organized. Recently physicists 
started to contribute to the development of new techniques to investigate mul- 
tivariate data (Blatt et al., 1996; Hutt et al., 1999; Mantegna, 1999; Giada 
and Marsili, 2001; Kraskov et al., 2005; Tumminello et al., 2005; Tsafrir et al., 
2005; Slonim, 2005). Among multivariate techniques, natural candidates for 
detecting the hierarchical structure of a set of data are hierarchical clustering 
methods (Anderberg, 1973). 

The modeling of the correlation matrix of a complex system with tools of 
hierarchical clustering has been useful in the multivariate characterization of 
stock return time series (Mantegna, 1999; Bonanno et al., 2001; Bonanno et 
al., 2003), market index returns of worldwide stock exchanges (Bonanno et 
al., 2000), and volatility increments of stock return time series (Micciche et 
al., 2003), where the estimation of statistical reliable properties of the corre- 
lation matrix is crucial for several financial decision processes such as asset 
allocation, portfolio optimization (Tola et al., 2008), derivative pricing, etc. 
We have termed the selection of statistical reliable information of the corre- 
lation matrix with the locution "filtering procedure" in Ref. Tumminello et 
al. (2007a). Hierarchical clustering procedures are filtering procedures. Other 
filtering procedures which are popular within the econophysics community arc 
procedures based on the random matrix theory (Laloux et al., 1999; Plcrou et 
al., 1999; Rosenow et al., 2002; Coronnello et al., 2005; Potters et al., 2005; 
Tumminello et al., 2007a), and procedures using the concept of shrinkage of a 
correlation matrix (Ledoit and Wolf, 2003; Schafer and Strimmer, 2005; Tum- 
minello et al., 2007b). Many others might be devised and their effectiveness 
tested. 

The correlation matrix of the time series of a multivariate complex system 
can be used to extract information about aspects of hierarchical organization 
of such a system. The clustering procedure is done by using the correlation 
between pairs of elements as a similarity measure and by applying a clustering 
algorithm to the correlation matrix. As a result of the clustering procedure, 
a hierarchical tree of the elements of the system is obtained. The correlation 
based clustering procedure allows also to associate a correlation based network 
with the correlation matrix. For example, it is natural to select the minimum 
spanning tree, i.e. the shortest tree connecting all the elements in a graph, as 
the correlation based network associated with the single linkage cluster anal- 
ysis. Different correlation based networks can be associated with the same 
hierarchical tree putting emphasis on different aspects of the sample correla- 
tion matrix. Useful examples of correlation based networks different from the 
minimum spanning tree are the planar maximally filtered graph (Tumminello 
et al., 2005) and the average linkage minimum spanning tree (Tumminello et 
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al., 2007c). 



In correlation based hierarchical investigations the statistical reliability of hi- 
erarchical trees and networks is depending on the statistical reliability of the 
sample correlation matrix. The sample correlation matrix is computed by us- 
ing a finite number of records T sampling the behavior of the elements 
of the system. Due to the unavoidable finitcncss of T, the estimation of the 
sample correlation matrix presents a degree of statistical uncertainty that can 
be characterized under widespread statistical assumptions. Physicists (Laloux 
et al., 1999; Plerou et al., 1999), have contributed to the quantitative esti- 
mation of the statistical uncertainty of the correlation matrix by using tools 
and concepts of random matrix theory. However, theoretical results providing 
the statistical reliability of hierarchical trees and correlation based networks 
are still not available and therefore, a bootstrap approach has been used to 
quantify the statistical reliability of both hierarchical trees (Tumminello et al., 
2007d) and correlation based networks (Tumminello et al., 2007c). 

The hierarchical tree characterizing a complex system can also be used to 
extract a factor model with independent factors acting on different elements 
in a nested way. In other words, the number of factors controlling each element 
may be different and different factors may act at different hierarchical levels. 
Tumminello et al. (2007d) have shown how to associate a hierarchically nested 
factor model to a system described by a given hierarchical structure. 

Having available a large number of filtering procedures, researchers encounter 
the necessity to have a quantitative methodology able to estimate the in- 
formation retained in a filtered correlation matrix obtained from the sample 
correlation matrix. It is also important to quantify the stability of the filter- 
ing procedure in different realizations or replicas of the process and a distance 
of the filtered correlation matrix from a given reference model. For all the 
above listed purposes, a very useful measure is the one using the Kullback- 
Leibler distance that was introduced in Tumminello et al. (2007a). This dis- 
tance presents the important characteristics that its value quantifying the dis- 
tance between a sample correlation matrix and the correlation matrix of the 
generating model turns out to be independent from the specific correlation 
matrix of the model both for multivariate Gaussian variables (Tumminello 
et al., 2007a) and for multivariate Student's t variables (Biroli et al., 2007; 
Tumminello et al., 2007b). 

In the present paper we discuss in a coherent and self-consistent way (i) some 
filtering procedures of the correlation matrix based on hierarchical clustering 
and the bootstrap validation of hierarchical trees and correlation based net- 
works, (ii) the hierarchically nested factor model, (iii) the Kullback-Leibler 
distance between the probability density functions of two sets of multivariate 
random variables and (iv) the retained information and stability of a filtered 
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correlation matrix. We apply the discussed concepts to a portfolio of stocks 
traded in a financial market. The paper is organized as follows. In Section [2] we 
discuss how to obtain hierarchical trees and correlation based trees or networks 
from the correlation matrix of a complex system and we discuss about the role 
of bootstrap in the statistical validation of hierarchical trees and correlation 
based networks. In Section [3] we discuss the definition and the properties of a 
factor model with independent factors which are hierarchically nested. In Sec- 
tion m we present an empirical application of the hierarchically nested factor 
model. Section O discusses how to quantify the information and stability of a 
correlation matrix by using a Kullback-Leibler distance and Section [HI presents 
the quantitative comparison of different filtering procedures performed with 
the same distance. Section [7] briefly presents some conclusions. 



2 Correlation based hierarchical organization and networks 



Hereafter we discuss a simple example illustrating two filtering procedures 
of a correlation matrix performed with methods of hierarchical clustering. 
In our approach, by using the correlation between elements as the similarity 
measure and by applying a given hierarchical clustering procedure, we first 
obtain a hierarchical tree. The information present in the hierarchical tree is 
completely equivalent to the information stored in the filtered matrix and, 
when the correlation is non-negative for each pair of elements, this matrix is 
positive definite (Tumminello et al., 2007d). 

Our example considers the correlation matrix of = 10 daily stock returns 
traded at the New York Stock Exchange during the time period from January 
2001 to December 2003. The investigated stocks are AIG, IBM, BAG, AXP, 
MER, TXN, SLB, MOT, RD, and OXY. The above presentation order of 
stocks is given according to their market capitalization at December 2003. 
In this paper, we indicate stocks with their tick symbols. The tick symbol, 
company name and other information of each company are given in Table I. 
From the Table we note that three stocks (OXY, RD, SLB) belong to the 
energy sector, three (IBM, MOT, TXN) to the technology sector and four 
(AIG, AXP, BAG, MER) to the financial sector. 

The stock return correlation matrix computed by using T = 748 records is the 
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following 



/ 1.000 0.413 0.518 0.543 0.529 0.341 0.271 0.231 0.412 
1.000 0.471 0.537 0.617 0.552 0.298 0.475 0.373 
1.000 0.547 0.592 0.400 0.258 0.349 0.370 
1.000 0.664 0.422 0.347 0.351 0.414 
_ 1.000 0.533 0.344 0.462 0.440 

^ ~ 1.000 0.305 0.582 0.355 

1.000 0.193 0.533 
1.000 0.258 
1.000 

V 

where the order of elements of the correlation matrix from left to right and 
from top to bottom is the one based on capitalization given above. 

A large number of hierarchical clustering procedures can be found in the liter- 
ature. For a review about the classical techniques see, for instance, Anderberg 
(1973). In this paper we focus our attention on the single linkage cluster anal- 
ysis (SLCA) and average hnkage cluster analysis (ALCA). 



0.294 \ 

0.270 

0.276 

0.269 

0.318 

0.245 

0.591 

0.166 

0.590 

1.000/ 



The starting point of both the procedures is the empirical correlation ma- 
trix C. The following procedure performs the ALCA giving as an output a 
hierarchical tree and a filtered correlation matrix C^lca '■ 

(i) Set B = C. 

(ii) Select the maximum correlation hhk in the correlation matrix B. Note 
that h and k can be simple elements (i.e. clusters of one element each) 
or clusters (sets of elements). \/i E h and Vj € k one sets the elements 
p<. of the matrix C^j^ca as p<- = pfi = hhk- 

(iii) Merge cluster h and cluster k into a single cluster, say q. The merging 
operation identifies a node in the rooted tree connecting clusters h and k 
at the correlation hhk- 

(iv) Redefine the matrix B: 

( Uhbh^+Ukhk, .^j^h^^^j^k 

rih + rik 

< 

_ bij — bij otherwise, 

where Uh and are the number of elements belonging respectively to 
the cluster h and to the cluster k before the merging operation. Note that 
if the dimension of B is m x m then the dimension of the redefined B is 
(m — 1) X (m — 1) because of the merging of clusters h and k into the 
cluster q. 

(v) If the dimension of B is larger than 1 then go to step (ii), else Stop. 
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By replacing point (iv) of tlie above algoritfim with the following item 



(iv)' Redefine the matrix B: 

'' Max[bhj,bkj] 



bij 



if j ^ h and j ^ k 
otherwise, 



one obtains an algorithm performing the SLCA and the associated filtered 
correlation matrix Cgi^Q^^. 

The hierarchical trees obtained from the sample correlation matrix of Eq. ([1]) 
by applying the ALCA and the SLCA are given in Fig. [1] and in Fig. [2] re- 
spectively. A hierarchical tree is a rooted tree, i.e. a tree in which a special 
node (the root) is singled out. In our example this node is ai. In the rooted 
tree, we distinguish between leaves and internal nodes . Specifically, vertices of 
degree 1 represent leaves (vertices labeled 1,2, 10 in Fig. [1]) while vertices 
of degree greater than 1 represent internal nodes (vertices labeled ai, 02,..., 
ag in Fig. [1]) . The two trees are slightly different showing that each clustering 
method produce a different output putting emphasis on different aspects of 
the sample correlation matrix. 

The filtered correlation matrix associated with the ALCA is 



/ 1.000 0.501 0.501 0.501 0.501 0.412 0.308 0.412 0.308 0.308 \ 
1.000 0.536 0.577 0.577 0.412 0.308 0.412 0.308 0.308 
1.000 0.536 0.536 0.412 0.308 0.412 0.308 0.308 
1.000 0.664 0.412 0.308 0.412 0.308 0.308 
1.000 0.412 0.308 0.412 0.308 0.308 
1.000 0.308 0.582 0.308 0.308 
1.000 0.308 0.562 0.591 
1.000 0.308 0.308 
1.000 0.562 
1.000/ 



^ALCA 



\ 



(2) 



whereas for the SLCA we obtain 
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For the sake of comparison, here the ALCA and SLCA filtered correlation ma- 
trices are both written with the same order of stocks of the sample correlation 
matrix. By comparing the sample and the filtered matrices one immediately 
notes that the filtered ones contain less information being defined by a number 
of distinct correlation coefficients equals to n — 1 whereas the original matrix 
has n{n — l)/2 distinct correlation coefficients. The two filtering methods 
detect different information. In fact the ALCA uses the average correlation 
coefficient between distinct groups of elements whereas the SLCA uses the 
maximal correlation. The two choices filter correlation coefficients character- 
ized by a different degree of representativeness and statistical reliability. 

It is worth noting that the hierarchical methods reveal the sectorial structure of 
the considered set of stocks. Specifically, in both cases the stocks belonging to 
the energy sector form a cluster. Fig. [T] shows that for the ALCA dendrogram 
the node ^2 splits the stocks in two sets, one composed by two technology 
stocks and one composed by the financial stocks plus the IBM. For the SLCA 
dendrogram the separation of the set in the technology and financial subsectors 
is less sharp (see Fig. |2]). However in general hierarchical methods perform 
quite well in identifying groups of stocks belonging to the same economic 
sector (Mantegna, 1999; Bonanno et al., 2001; Coronnello et al., 2005). 

In addition to the hierarchical trees and to the related filtered correlation ma- 
trices one can also obtain correlation based networks. Here we briefly recall 
how to select a correlation based graph out of the complete graph describ- 
ing the system. A complete graph is a graph with links connecting all the 
elements (or nodes in the graph terminology) of the system of interest. In 
correlation based networks a weigth, which is monotonically related to the 
correlation coefficient of each pair of elements, can be associsted with each 
link. Therefore one can immediately associates a weighted completed graph 
with the correlation matrix among n elements of interest. A complete graph 
is too rich of information and therefore a "filterin" (or "pruning") of it can 
improve its readability. For this reason a procedure can be set to select a 
subset of links which are highly informative about the hierarchical structure 
of the system. By using clastering algorithms as filtering procedures a certain 
number of correlation based graphs have been investigated in the econophysics 
literature. Correlation based networks which have been found very useful in 
the elucidation of economic properties of stock returns traded in a financial 
market are the minimum spanning tree (MST) (Mantegna, 1999), the planar 
maximally filtered graph (PMFG) (Tumminello et al., 2005) and the average 
linkage minimum spanning tree (ALMST) (Tumminello et al., 2007c). In the 
cited cases all the elements of the system are connected within the graph. 
Correlation based graphs with elements disconnected from a giant component 
can also be obtained starting from the correlation matrix. For example, an 
extension from trees to more general graphs generated by selecting the most 
correlated links has been proposed in Onnela et al. (2003). However, this last 
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method selects only a subset of the investigated elements controlled by an 
arbitrarely chosen threshold. 

The MST is a correlation based tree associated with the SLCA. An illustrative 
algorithm providing the MST is the following. Let us first recall that the 
connected component of a graph g containing the vertex i is the maximal set of 
vertices Si (with i included) such that there exists a path in g between all pairs 
of vertices belonging to Si. When the element i has no links to other vertices 
then Si reduces just to the element i. The starting point of the procedure is 
an empty graph g with N vertices. The MST algorithm can be summarized 
in 6 steps: 

(i) Set Q as the matrix of elements qij such that Q = C. 

(ii) Select the maximum correlation q^^ between elements belonging to dif- 
ferent connected components Sh and Sk in g. At the first step of the 
algorithm connected components coincide with single vertices in g. 

(iii) Find elements m, p such that pup = Max{pjj,Vz G Sh and Vj G 5*^} 

(iv) Add to g the link between elements u and p with weight pup- Once the 
link is added to g, u and p will belong to the same connected component 
S = ShV} Sk- 

(v) Redefine the matrix Q: 

' Qij = Qhk, iiie Sh and j G Sk 
Qij = 'MsLx{qpt,p G 5* and t G Sj, with Sj ^ 5} , 
if z G 5* and j G 5*^ 
. = otherwise; 

(vi) If G is still a disconnected graph then go to step (ii), else stop. 



The resulting graph g is the MST of the system and the matrix Q is the 
correlation matrix associated to the SLCA. The presented algorithm is not 
the most popular or the simplest algorithm for the construction of the MST 
but it clearly reveals the relation between SLCA and MST. Indeed connected 
components progressively merging together during the construction of g are 
nothing else but clusters progressively merging together in the SLCA. In Fig. 
|3] we show the MST associated with the considered example. It should be 
noted that the correlation based tree contains more information than the hi- 
erarchical tree or the filtered correlation matrix. For example, the fact that 
the connection between the cluster of two technology stocks (MOT and TXN) 
and the cluster of mostly financial stocks (MER, AXP, AIG, BAC and IBM) 
occurs through IBM is something which is not contained in the hierarchical 
tree but it is present in the MST. 
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Fig. 1. Average linkage cluster analysis. Illustrative example of a hierarchical tree 
associated to a system of = 10 stocks (tick symbols label stocks at the bottom 
of the hierarchical tree. Each element of the system is also labeled with an integer 
number). The color of line indicates the primary economic sector of the stock, red 
for technology, blue for energy and green for financial. The labels of the nodes of 
the hierarchical tree are used in the discussion of the hierarchically nested factor 
model of Section [3l 

By replacing eq. (jl]) with 

' Qij = Qhk, Hie Sh and j E Sk 
Qij = Mean {qpt,p € S and t E Sj, with Sj ^ S} , 
a i E S and j G Sj 
. Qij = liji otherwise; 

in the step (v) of the above procedure one obtains an algorithm performing 
the ALCA and the final Q of the procedure is the correspondent correlation 
matrix. The obtained tree g that we termed ALMST (Tumminello et al., 
2007c) is a tree naturally associated with such a clustering procedure. The 
choice of the link at step (iii) of the ALMST construction algorithm does not 
affect the clustering procedure but specify the construction of the correlation 
based tree. More precisely by selecting any link between nodes u E Sh and p G 
Sk the matrix Q representing the result of ALCA remains the same in terms 
of hierarchical tree. This degeneracy allows one to consider different rules to 
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AXP MER IBM BAC TXN MOT AIG SLB OXY RD 



Fig. 2. Single linkage cluster analysis. Illustrative example of a hierarchical tree 
associated to a system of = 10 stocks (tick symbols label stocks at the bottom 
of the hierarchical tree). The color of line indicates the primary economic sector of 
the stock, red for technology, blue for energy and green for financial. 

select the link between elements u and p at the step (ill) of the construction 
algorithm. Different rules at step (iii) give rise to different correlation based 
trees. The same observation holds true for the algorithm that generates the 
MST. This fact implies that in principle one can consider spanning trees which 
are different from the MST and are still associated with the SLCA. However, 
we have already recalled that the MST is unique in the sense that, when an 
Euclidean distance is defined between links of the spanning tree, MST is the 
spanning tree of shortest length (West, 2001). 

For the present example the ALMST is essentially indistinguishable from the 
MST and for this reason we will not display it here. It is worth noting that 
whereas the hierarchical trees obtained with ALCA and SLCA show slight 
differences, these differences essentially disappears at the level of the associated 
correlation based trees in the present example. 

Starting from the sample correlation matrix one can also obtain correlation 
based networks having a structure more complex than a tree. One of such 
correlation based networks is the PMFG (Tumminello et al., 2005). This cor- 
relation based network has associated a hierarchical structure which is the one 



10 



AIG 



0.99 



0.57 



0.58 



MER 1 



IBM 069 TXN 1 MOT 



Fig. 3. Minimum spanning tree associated with the SLCA of the example. The 
vertices indicate the stocks. Colors indicate the different economic sectors, red for 
technology, blue for energy and green for financial. The thickness of links is pro- 
portional to the bootstrap percentage and this value is the number close to each 
link. 

given by SLCA but it presents a graph structure which is richer than the one 
of the MST. In fact, the PMFG has loops and cliques. A clique of k elements is 
a complete subgraph that links all k elements. Due to topological constraints, 
only cliques of 3 and 4 elements are allowed in the PMFG. To illustrate the 
PMFG algorithm, let us first consider a different construction algorithm for 
the MST. Following the ordered list Sord of correlation coefficients starting 
from the couple of elements with largest correlation one adds a link between 
element i and element j if and only if the graph obtained after the link inser- 
tion is still a forest or it is a tree. A forest is a disconnected graph in which any 
two elements are connected by at most one path, i.e. a disconnected ensemble 
of trees. With this procedure, equivalent to the algorithm above detailed, the 
graph obtained after all links of Sord are considered is the MST. In direct anal- 
ogy, Tumminello et al. (2005) introduce a correlation based graph obtained by 
connecting elements with largest correlation under the topological constraint 
of fixed genus G = 0. The genus is a topologically invariant property of a 
surface defined as the largest number of nonintersecting simple closed curves 
that can be drawn on the surface without separating it. Roughly speaking, it 
is the number of holes in a surface. The construction algorithm for such graph 
is: following the ordered list Sord starting from the couple of elements with 
largest correlation one adds a link between element i and element j if and 
only if the resulting graph can still be embedded on a plane or a sphere, i.e. 
topological surfaces with G = 0. A basic difference of the PMFG with respect 
to the MST is the number of links which is — 1 in the MST and 3 (A^ — 2) in 
the PMFG. Moreover, the PMFG is a network with loops whereas the MST 
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Fig. 4. Planar maximally filtered graph of the correlation matrix of the considered 
example. The vertices indicate the stocks. Colors indicate the different economic 
sectors, red for technology, blue for energy and green for financial. The thickness of 
links is proportional to the bootstrap percentage and this value is the number close 
to each link. 

is a tree. It is worth recalling that Tumminello et al. (2005) have proven that 
the PMFG always contains the MST. 

In Fig. m we show the PMFG obtained for the considered example. In the 
figure the length of the links is not related to the similarity measure between 
the two vertices they connect. We are using this kind of representation to put 
emphasis on the topological planarity of the network. In fact from the figure 
it is evident that there are no crossings of links, and the entire network is 
topologically embedded in a plane. By comparing Fig.sOandlHwe note that 
the stock MER which turns out to be of central reference in the MST and 
ALMST is the only stock partecipating to all the seven 4-cliques which are 
observed in the PMFG. In other words, the PMFG allows to consider more 
details present in the sample correlation matrix than those selected by the 
MST or ALMST. For example the PMFG of Fig.Hshows that two stocks of the 
financial sector (AXP and MER) are connected to stocks of both the tecnology 
and energy sector. Such a property was not present in the MST (or ALMST) 
where only MER was linking two stocks of the two sectors (specifically RD 
and IBM). The PMFG is therefore showing more details on the interrelations 
present among stocks than the MST. The PMFG has been recently used to 
investigate stock return multivariate time series in references Tumminello et 
al. (2005), Coronnello et al. (2005) and Tumminello et al. (2007e). 

The statistical reliability of regons of hierarchical trees and correlation based 
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graphs cannot be theoretically evaluated in spite of the fact that the statistical 
reliability of the spectral properties of the correlation matrix can be assessed 
under the assumption of multivariate normal distribution for the time series 
of the elements of the investigated set. In the absence of such a theoretical ap- 
proach we have devised a method to evaluate the statistical reliability of nodes 
in a hierarchical tree obtained by using a correlation matrix as a similarity 
measure and links in a correlation based graph. The method we use is based 
on a bootstrap procedure of the time series used to compute the correlation 
matrix of the system. The method is detailed in Tumminello et al. (2007d) and 
Tumminello et al. (2007c). Here we just sketch the most important aspects of 
the procedure allowing to associate a bootstrap value to each internal node of 
a hierarchical tree. Consider a system of N time series of length T and sup- 
pose to collect data in a matrix X with columns and T rows. A bootstrap 
data matrix X* is formed by randomly sampling T rows from the original 
data matrix X allowing multiple sampling of the same row. For each replica 
X*, the associated correlation matrix C* is evaluated and a hierarchical tree 
is constructed by hierarchical clustering. A large number (typically 1000) of 
independent bootstrap replicas is considered and for each internal node of the 
original data hierarchical tree we compute the fraction of bootstrap replicas 
(commonly referred to as bootstrap value) preserving the internal node in the 
hierarchical tree. Given an internal node ak of the original hierarchical tree, 
we say that a bootstrap replica is preserving that node if and only if a node 
in the replica hierarchical tree exists and identifies a branch characterized by 
the same leaves identified by in the original hierarchical tree. For instance, 
we say that the node of the hierarchical tree in Fig. [T] is preserved in some 
replica hierarchical tree D* if and only if a node of D* exists such that it 
connects all and only the leaves 1, 2, 3, 4, and 5. 

In Fig. O we show the result of the application of the bootstrap procedure 
to the ALCA hierarchical tree of the example shown in Section [2J From the 
figure it is evident that different nodes have a different statistical reliability 
as quantified through the bootstrap value. 

An analogous uncertainty is observed and quantified in correlation based 
graphs with a similar methodology. Consider a system of elements and 
suppose to collect data in a matrix X with columns and T rows. A boot- 
strap data matrix X* is formed by randomly sampling T rows from the original 
data matrix X allowing multiple sampling of the same row. For each replica 
X*, the associated correlation matrix C* is evaluated and a correlation based 
graph is constructed. By applying the procedures described in the previous 
subsection to C one can construct, for example, the MST, the PMFG and 
the ALMST of the system. The bootstrap technique requires to construct a 
number r of replicas X*, i e 1, r of the data matrix X. Usually, r = 1000 is 
considered a sufficient number of replicas. For each replica X* the correlation 
matrix is evaluated and the correlation based graph of interest is obtained. 
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Fig. 5. Average linkage cluster analysis. Illustrative example of the estimation of 
the bootstrap value associated with each node of the hierarchical tree. The color of 
horizontal lines indicate the bootstrap value b of the node according to the following 
color code: Green 0.4 < 6 < 0.6, Cyan 0.6 < 6 < 0.8, and Purple 0.8 < 6 < 1.0. 



The result is a collection of correlation based graphs. For example, in the case 
of MSTs, {MSTl, ...,MST*} . To associate the so called bootstrap value to a 
link of the original correlation based graph (in the present example a MST) 
one evaluates the number of MST* where the link is appearing and normalizes 
such a number with the total number of replicas, e.g. r = 1000. The bootstrap 
value gives information about the reliability of each link of a correlation based 
graph. It is worth noting that the bootstrap approach does not require the 
knowledge of the data distribution and then it is particularly useful to deal 
with high dimensional systems where it is difficult to infer the joint probability 
distribution from data. One might then be tempted to expect that the higher 
is the correlation associated to a link in a correlation based network the higher 
is the reliability of the link. Tumminello et al. (2007c) show that such hypoth- 
esis is not always observed in empirical results of sets of stock returns traded 
in a financial market. The bootstrap vaue and the correlation coefficient can 
be different indicating a different degree of stability with respect to metric 
and topological aspects. In Figs [3] and H] the bootstrap values associated with 
each link are reported in the figure as the number close to each link. For a 
detailed discussion about the use of the bootstrap procedure to estimate the 
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reliability of correlation based graphs see Tumminello et al. (2007c). 



3 A hierarchically nested factor model 

The filtering procedure of the correlation matrix provides filtered correlation 
matrices carrying information on the hierarchical structure of the investigated 
system. A hierarchical factor model associated with such a matrix is useful in 
the modeling of hierarchical complex systems. In this Section we discuss the 
hierarchically nested factor model (HNFM). The HNFM is introduced in Tum- 
minello et al. (2007d) in such a way that its correlation matrix coincides with 
the similarity matrix filtered by a chosen hierarchical clustering procedure. 

Hereafter, we illustrate the methodology to associate a nested factor model to 
a multivariate data set. The association is done by retaining all the information 
about the hierarchies detected by a hierarchical clustering. This is achieved 
by considering a factor model in bijective relation with a hierarchical tree. We 
are going to present our method by making use of the illustrative hierarchical 
tree given in Fig. [TJ We first note that to each leaf i (or internal node ah) 
one can associate a genealogy G{i) {G{ah) for the internal node ah) which is 
the ordered set of internal nodes connecting leaf i (internal node ah) to the 
root ai. For instance the genealogy associated to the leaf 3 in the figure is 
(7(3) = {tte, «4, «3, «2, «i} and the genealogy of the internal node ay in the 
figure is G{a-j) = {a7,a2,ai}. Note that the internal node ay is included in 
(^(ay). We say that an internal node w is the parent of the node v and we 
use the notation w = g{v) if w immediately precedes v on the path from the 
root to V. For example a2 = g{aj) in Fig. [H Beside the topological structure, 
hierarchical trees obtained through clustering algorithms of the correlation 
matrix have also metric properties. In fact clustering algorithms associate to 
each internal node a, a correlation coefficient pa-. Our internal node labeling 
implies that pa- < Poi+i and here we consider p^^ > . In there are at 
most N — 1 distinct coefficients (see discussion in Section [2]). Exactly N — 1 
distinct coefficients are obtained in case of binary rooted trees. 

In Tumminello et al. (2007d) we introduce the factor model 

^^it)= E ^aJ^'^'Kt) + V^ e^it) (6) 

where i E {!,..., A}, rn = [1 - Ea.gGW TaJ'^', ^^e h"" factor f^'^'^^t) and q 
are i.i.d. random variables with zero mean and unit variance. By fixing the 7 
parameters as 
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7«i ~ V P«i 

=^Pa^ -pgM V/i = 2,...,n- 1 



(7) 



the model of Eq. ([6]) is the factor model characterized by a correlation matrix 
equals to a given matrix C"^. It should be noted that by assuming > 0, 
all the coefficients 7q,^ are non negative real numbers. In Tumminello et al. 
(2007d) we prove that correlation pfj = pa^.. In fact, the cross correlation ^ 
(xiXj) only depends on the factors /^"'"^ which are common to Xi and Xj. Since 
one associates a factor to each internal node, one needs to identify the internal 
nodes belonging to both the genealogies G{i) and G{j). One can verifies that 
G{i) nG(j') = G{ak)- For example, in Fig. [1] we have that G{5) = {a3,a2,ai} 
and G(6) = {aY,a2,ai} so that ^(5) fl G(6) = {02, ai} = G{a2)- By making 
use of Eqs. dHl E]) the cross correlation between variables Xi and Xj is 

{XiXj) = J2 7a, = Pa, = Pfj- (8) 

For example with reference to Fig. [1] we have {x^xq) = 7^^ + 7ai = P012 ~ 
Pax + Pax = P«2- Thus the matrix C'^ is the correlation matrix associated with 
the factor model of Eq. ([2]). It is worth noting that the existence of a factor 
model whose matrix is the correlation matrix implies that the matrix 
is always positive definite if pa^ > 0. 

In the case in which negative correlations are associated with some nodes 
in the dendrogram, it is sometimes possible to suitably modify Eqs. ([7]) by 
introducing multiplicative sign variables in order to get an HNFM describing 
the system. The description of the most general case is left for a future work. 
Here we just consider the case in which only pa^ < 0, because this is the case 
in the empirical application described in section HI Let us assume that all the 
correlations associated with nodes in the dendrogram are non negative but 
Pax < 0. Furthermore assume that \pax\ < Pa2 (this constraint is satisfied in 
the empirical application of Section H]). In order to construct the HNFM, we 
divide the elements of the system into two groups. These are the two groups of 
elements merging together at root node. The coefficient 7^^ shall be different 
for elements belonging to different groups. Specifically, 



7^^ = —y\pax I for all the elements of the first group 

7ai = y\Pax\ for 3,11 the elements of the second group, (9) 

whereas we don't need to distinguish among elements belonging to different 
groups for the other 7 coefficients. Specifically 

^ Here and in the following we indicate with the notation {x) the mean value of the 
X variable. 
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7a;, =^Po.n- IPsK)! V/i = 2,...,n- 1. 



(10) 



We note that the constraint | < p^^ is required by Eq. (fTOl) in order to have 
a real value of 7 coefficients. It is also to notice that |pg(ah)l = Pg(ah) V(7(a/i) 7^ 
«!, and accordingly, the 7 coefficients associated with all of the nodes differ- 
ent from the root node and its sons as given in Eq. ffTOl) coincide with the 
corresponding coefficients as defined in Eq. ([7]). 

Eq. ([6]) defines a HNFM of — 1 factors obtained from a hierarchical tree 
of elements. In general the number of factors determining the dynamics 
of the system can be significantly smaller than — 1. Moreover a correlation 
coefficient matrix obtained from a finite multivariate time series has associated 
an unavoidable statistical uncertainty that might introduce spurious factors. 
To overcome this problem, in Tumminello et al. (2007d) we propose a method 
devised to select the HNFM characterized by the largest number of factors 
(although in any case less than A^) compatible with a predefined threshold 
of statistical reliability of retained factors. The method of Tumminello et al. 
(2007d) exploits the technique of non parametric bootstrap (Efron, 1979; Efron 
and Tibshirani, 1994). The bootstrap technique allows to associate a bootstrap 
value to each internal node of a hierarchical tree. Due to the one by one relation 
between nodes in the hierarchical tree and factors in the HNFM, the bootstrap 
value associated to a certain node of the hierarchical tree is associated also to 
the corresponding factor in the HNFM. 

Since the bootstrap value is a measure of the node's reliability, we propose to 
remove those nodes, and therefore the corresponding factors, with bootstrap 
value smaller than a given threshold h. This is done by merging each node 
with a bootstrap value smaller than h with its ffist ancestor node in the path 
to the root having a bootstrap value greater than h and then constructing the 
HNFM associated with this reduced hierarchical tree. The question is how to 
select the threshold h. The bootstrap value of a certain node (factor) cannot be 
straightforwardly intended as the probability that the node (factor) belongs 
to the true and unknown hierarchy (model) of the system. For example, in 
phylogenetic analysis Hillis and Bull (1993) have shown that bootstrap values 
of more than 70% correspond to a probability of more than 95% that the true 
phylogeny has been found. In Tumminello et al. (2007d) we do not choose a 
'priori the value of h but we infer a suitable value of the threshold from the 
data in a self consistent way. The detailed procedure used to determine the 
threshold from data is discussed in Tumminello et al. (2007d) . 
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Fig. 6. Hierarchical tree of the set of daily equity return of 100 highly capitalized 
stocks traded at the NYSE during the period 2001-2003 obtained by applying the 
average linkage clustering algorithm to the correlation matrix. Colors are chosen 
according to the stock economic sector according to the classification of Yahoo 
Finance. Specifically these sectors are Basic Materials (violet), Consumer Cyclical 
(tan), Consumer Non Cyclical (yellow), Energy (blue). Services (cyan). Financial 
(green). Healthcare (gray). Technology (red). Utilities (magenta). Transportation 
(brown). Conglomerates (orange) and Capital Goods (light green). 

4 An empirical application: a set of stocks traded in a financial 
market 



As an application of the described technique to real data we examine the set of 
daily equity return of A = 100 highly capitalized stocks traded at the NYSE 
during the period 2001-2003 (T = 748). Specifically, we apply the ALCA to the 
correlation matrix of the system and we obtain the hierarchical tree shown in 
Fig.O In the figure, the identity of each stock is labeled by an integer number. 
The correspondence between each number and the tick symbol of the stock 
is provided in Table I. In the same Table we also provide information about 
the company name and economic sector and sub-sector classified according to 
Yahoo Finance. 



To evaluate the statistical robustness of each node and to simplify the descrip- 



18 




0.1 
0.2 
0.3 
0.4 
■ 0.5 
0.6 
0.7 
0.8 
0.9 
I 



□ 
Fl 



F2 



"23 

a4 



SI 



T E 



stocks 



S2 HI 



«2 



Fig. 7. Hierarchical tree with 27 internal nodes obtained by node reduction of the 
ALCA hierarchical tree shown in Fig. [6j Rectangles at the bottom are indicating 
8 clusters and the associated symbols label the classification of stocks in terms of 
economic sectors or sub-sectors according to the classification of Yahoo Finance (see 
text for the legend). Colors of lines indicate stock sectors as in Fig. [6l The labeled 
internal nodes are discussed in the text. In the figure we do not comment on clusters 
composed by only two leaves. 



tion in terms of a HNFM we use the bootstrap technique discussed above. In 
particular, we select in a self-consistent way (see Tumminello et al. (2007d)) 
the bootstrap value threshold, which turns out to be 6 = 70% for the consid- 
ered dataset. The corresponding reduced hierarchical tree has 27 nodes and it 
is reported in Fig. [71 

Let us first comment the properties of the reduced HNFM. In the figure we 
observe several clusters and sub-clusters. As already noticed in previous studies 
(Mantegna, 1999; Bonanno et al., 2001; Tumminello et al., 2005), the detected 
clusters and sub-clusters are overlapping in part with economic classification 
such as, for example, the one provided by the Yahoo Finance (at April 2005). 
This can be seen in Fig. |6] and [7] where we use this classification to characterize 
with a specific color each stock. Most of the groups detected by hierarchical 
clustering are characterized by the same color. For example, financial firms 
are represented in Fig. [6] and [7] as green lines in the hierarchical tree. 
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The root of the dendrogram of Fig. [7] is associated with a parameter p^^ = 
—0.004. This value is negative even if it is not statistically significantly different 
from zero given that the error associated with the correlation coefficient is 
I/a/T = 0.036. The fact that < requires the introduction of sign variables 
as explained at the end of Section [31 Since it is pa2 = 0.2 > IpaJ, we can use 
the Eg. sl9] and [To] to determine the parameters of the model. The root node ai 
splits the set of stocks in two subsets, one composed by one stock (NEM, a gold 
mining company, see Table I) and another composed by 99 stocks. The value of 
Pai is consistent with the interpretation that NEM is uncorrelated to the rest of 
the stocks. By using Eq. [Hlwe set 7^^ = —0.063 and 7^^ = +0.063. The second 
factor (node) 0:2 describes the market mean behavior and it is associated with 
the parameter 702 = 0.44. The other 25 factors describe clusters of stocks 
that are often significantly homogeneous with respect to the sector activity 
of the stocks. In Fig. [7] we have highlighted 8 clusters by using rectangles 
at the bottom of the figure. Specifically, Fl is the sub-sector of investment 
services and F2 contains the sub-sectors of regional banks and money center 
banks. Both Fl and F2 belong to the economic sector of Financial; T and 
E are indicating the economic sectors of Technology and Energy respectively; 
HI indicates the sub-sector major drugs of the economic sector Healthcare; SI 
and S2 indicate the two sub-sectors of retail and communication services of the 
sector of Services respectively. Finally, X is a cluster which is not homogeneous 
with respect to sector and sub-sector classification. It comprises stocks in the 
sector of basic materials, stocks of the sub-sector constructions of capital goods 
and stocks as EMR (classified as technology) and GM (classified as consumer 
cyclical) . 

One prominent example is the group of technology stocks (group T in Fig. [7|). 
The first two stocks (their tick symbols are TXN and ADI) from left to right 
of the group labeled as T in the reduced HNFM of Fig. [7] are described by the 
equation 

4{t) = la,,f^''\t) + la,f'''\t) + E 7a J^^'^n^) + ^^^(t) = (H) 

h=l 

0,57j("23)(^) + 0.51/("*)(t) + 0.44/("2)(t) + 0.063/("^)(t) + 0.47ei(t) 

The factors /("^^(t) and /("2)(t) are common to almost all stocks whereas 
(t) and /^"^^(t) are specific to these stocks. The other four technology 
stocks (which are EMC, IBM, MOT and CA) are described by the equation 

xfit) = 7.j("*Hi) + E laj^"'\t) + r,Mt) = (12) 

h=l 

0.51/(°^)(t) + 0.44/("2)(t) + 0.063/("^)(^) + 0.74ei(t). 
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In this last case only the f^"'^\t) factor is present in addition to the f^"^\t) 
and factors common to almost all stocks. It is therefore natural to 

consider as a factor characterizing technology stocks whereas f^°''^^\t) 

is an additional factor further characterizing only the two stocks TXN and 
ADI. A similar organization in nested clusters is observed in all the groups 
detected by the reduced HNFM. The number of factors characterizing the 
various stocks is ranging from one to five. 

It is worth to compare Fig. [6] and [3 The comparison shows that the self- 
consistent reduction of the number of factors allow a robust statistical valida- 
tion of the groups that are detected from the data analysis. Only the informa- 
tion which is statistically robust at the 95% level is retained in the reduced 
HNFM. For example, the financial cluster observed at the left end of the hier- 
archical tree in Fig. [6] is not robust at the selected confidence level whereas the 
two sub-clusters indicated as Fl (LEH, BSC, MER and SCH) and F2 (NCC, 
STI, ONE, PNC, BAC, WFC, BK and MEL) in FigJT] are. This empirical 
analysis has shown the usefulness of HNFM in an empirical investigation of 
hierarchically organized complex systems. 



5 Information and stability of a correlation matrix via the Kullback- 
Leibler distance 

In Tumminello et al. (2007a) we propose to measure the performance of filter- 
ing procedures by using the KuUback-Leibler distance introduced by Kullback 
and Leibler (1951). The KuUback-Leibler distance (see, for instance. Cover 
and Thomas (1991)) or mutual entropy is a measure of the distance between 
two probability densities, say p and q. It is defined as 



where Ep[.] indicates the expectation value with respect to the probability 
density p. The Kullback-Leibler distance is asymmetric. In Eg. 0131) the expec- 
tation value is evaluated according to the distribution p. 

We consider first the Kullback-Leibler distance between multivariate Gaussian 
random variables (Tumminello et al., 2007a). We consider variables with zero 
mean and unit variance without loss of generality because we are interested 
in the comparison of the correlation matrices of the two sets of variables. In 
this case, the Gaussian multivariate distribution associated with the random 
vector X is completely defined by the correlation matrix S of the system. 
In the following we indicate the probability density function with P(I],X). 
Given two different probability density functions P(Si,X) and P(S2,X), we 




(13) 
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have 



ir(P(Si,X),P(S2,X)) = Ep(s„x) log 



I P(Si,X)log 



P(Si,X) 



(14) 



By performing the integral in Eq. ([T^ one obtains: 



X(P(Si,X),P(E2,X)) = ^ log 




) 



+ tr(S2-^Si) - AT , (15) 



where N is the dimension of the space spanned by the X variable and |S| indi- 
cates the determinant of S. From now on we indicate K{P{1!ii, X), P(S2; X)) 
simply with i^'(Si, S2). It is worth noting that the Kullback-Leibler distance 
takes naturally into account the statistical nature of correlation matrices. In- 
deed if(Si,S2) is well defined only provided that the matrices Si and S2 
are positive definite. This property is not common to other measures of dis- 
tance between matrices. However this property can also be a limitation. The 
Kullback-Leibler distance cannot be used to quantify the distance between 
semi-positive correlation matrices that are observed, for example, when the 
length T of data series is smaller than the number N of elements of the sys- 
tem. 

The Kullback-Leibler distance is also related to the maximum likelihood fac- 
tor analysis (Mardia et al., 1979). In fact, the log-likelihood function to be 
maximized in order to describe a system of N elements with sample correla- 
tion matrix C is a function of the Kullback-Leibler distance between the C 
and the model correlation matrix (see Tumminello et al. (2007a) for details). 

We have obtained the value of the Kullback-Leibler distance between two mul- 
tivariate distribution as a function of the two corresponding Pearson correla- 
tion matrices. We are interested to the case in which one or both correlation 
matrices are sample correlation matrices and thus are random variables. Since 
different realizations of the process give rise to different sample correlation 
matrices, a Kullback-Leibler distance having one or two sample correlation 
matrices as arguments is a function of one or two random matrices. 

In the case of multinormally distributed variables, we consider a random vec- 
tor X of dimension N with a model correlation matrix S. Let Ci and C2 be 
two sample correlation matrices obtained from two independent realizations 
of the system both of length T. It is known that in this case sample covari- 
ance matrices belong to the ensemble of Wishart random matrices and many 
statistical properties of Wishart matrices are known (Mardia et al., 1979)). 
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By making use of the theory of Wishart matrices, we obtained (Tumminello 
et al, 2007a) that 



iVlog 



+ E 

p=T-N+l 



Tip/2) 



, NjN + l) , 
+ T-N-1 > 



£;[K(Ci,s)] = i iviog(l)- E 

^1 ^ ^ ^ p=T-N+l 



r(p/2) 



(17) 



and 



L V i> 2j\ 2T-N-1' 



where T(x) is the usual Gamma function and T'(x) is the derivative of T(x) 



It is important to observe that all the expectation values given in Eq.s ( ITMTSi) 
are independent of S, i.e. they are independent of the specific model gener- 
ating or describing the data. The independence property implies that (i) the 
KuUback-Leibler distance is a good measure of the statistical uncertainty of 
correlation matrix which is due to the finite length of data series and (ii) the 
expected value of the KuUback-Leibler distance is known also when the un- 
derlying model hypothesized to describe the system is unknown. This fact has 
important consequences. Suppose one knows that the observed data are well 
approximated by a multivariate Gaussian distribution and that one measures 
a sample correlation matrix C. In order to remove some unavoidably present 
statistical uncertainty, the observer applies a filtering procedure to the data 
obtaining the filtered ^ correlation matrix C-^*'*. If the filtering technique is 
able to recover the model correlation matrix, i.e. C-^*'* = 5], the KuUback- 
Leibler distance K{C, C-^''*) must be equal on average to the value given in 
Eq. f|T7|) . This expected value is independent on the (unknown) model correla- 
tion matrix S. Therefore large deviations from this expectation value indicate 
that the filtered matrix is not consistent with the true matrix of the system. 
If K{C, C-^*'*) is significantly smaller than the expectation value of Eq. flTTl) 
the filtered matrix is keeping some of the statistical uncertainty due to the 
finite length T. If, on the other hand, K{C, C-^*'*) is significantly larger than 
the value of Eq. (ITTI) . it means that the filtered matrix is either filtering too 
much information or distorting the signal. The distance between K{C, C-^*'*) 
and the expected value of Eq. (IT7|) is a measure of the goodness of the filtering 
procedure in keeping the maximal amount of information which can be present 
in sample correlation matrices estimated with a finite number of records. 



^ In this and in the following sections we use the superscript to indicate the filtering 
procedure, whereas in Section 2 we used the subscript. 
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A second aspect concerns the stability of the filtered correlation matrix ob- 
tained from a sample matrix. Let us suppose to apply a certain filtering pro- 
cedure to the correlation matrices Ci and C2 of two independent realizations 
of the system, obtaining two filtered correlation matrices C{*^* and C2*'*. If it 
turns out that i^'(C{*'*, C^^^^) is larger than the expected value of K{Ci, C2) 
described by Eq. f|T8l) . one can conclude that the filtering procedure produces 
correlation matrices less reproducible than the sample correlation matrices 
and therefore the procedure is not suitable for the purpose of filtering robust 
information from the empirical correlation matrices Ci and C2. 

In summary, in Tumminello et al. (2007a) we have shown that the KuUback- 
Leibler distance is very good for comparing correlation matrices. Its main 
properties are that (i) it is an asymmetric distance and therefore it can 
distinguish between quantities observed in real systems and used to model 
the empirical observations, e.g. the sample correlation matrix and the fil- 
tered correlation matrix respectively and (ii) the expectation values of the 
Kullback-Leibler distance given in Eq.s (11611181) are model independent, indi- 
cating that this distance is a good estimator of the statistical uncertainty due 
to the finite size of the empirical sample. These properties are not observed in 
other widespread distances between matrices. For example, Tumminello et al. 
(2007a) have shown that these properties are not observed for the Frobenius 
distance, which is a standard measure of the distance between matrices. 

Biroli et al. (2007) have extended the above results on the Kullback-Leibler 
distance to a general class of elliptic distributions. Specifically, they considered 
random variables Xi {i = 1,...,N) that can be generated by starting from 
random variables t/i {i = 1, N) following a generic multivariate distribution 
and by setting = si/i, where s is a positive random variable. 

As a specific example, which is also relevant for financial data, they considered 
the multivariate Student's t-distribution. In this case the variables yi are taken 
from a multivariate normal distribution with correlation matrix S and s is 
distributed according to 



Pis) = —, — r^exp 

^ ^ r(/i/2) 



s2 



(19) 



where s\ = 2n/{iJ, — 2) in such a way that s has unit variance. The joint 
probability density function for the Xi is 

P(xi, X2, xn) = , \' WW (20) 

2 



r(;./2)v'(^vr)^|S|(i + i^^_^.,^^(5^-i)^^.^^. 



where the parameter /i gives the degrees of freedom of the distribution and 
describes the tail behavior of the marginal distribution of any Xi since P{xi) ~ 



-1-11 
Xi ^. 
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Let us assume that the correlation matrix is computed with the Pearson esti- 
mator for correlation coefficients. Biroli et al. (2007) show that the Kullback- 
Leibler distance between two multivariate Student's t-distributions with the 
same scaling parameter /i, and correlation matrices Si and S2 is 



1 


log 








2 







+ {N + fi) J dsP{s) log 



i^(Si,S2) 

1 + tr (S2-^Si) /(2s) 



+ 



1 + N/{2s) 



(21) 



In the limit fi/N 00 this expression coincides with the one obtained for the 
Gaussian case above, whereas in the limit fi/N Biroli et al. (2007) obtain 



ir(Si,S2 



log 



|Si| 



+ log 



tr 



(22) 



Now consider a random vector X of dimension N with correlation matrix S 
and distributed according to Eq. (!20|) . Let Ci and C2 be two sample cor- 
relation matrices obtained from two independent realizations of the system 
both of length T. It is possible to show (Biroli et al., 2007) that, similarly 
to the Gaussian case, the expectation values E [i^(S, Ci)], E [K{Ci, E)] and 
E [K{Ci, C2)], where the KuUback-Leibler distance is calculated according to 
either Eq. fl2Tl) or Eq. fl22|) . are not depending on S, i.e. the expectation values 
are model independent. This important property is valid not only for Gaus- 
sian multivariate variables but it holds true in general for elliptic multivariate 
distributions. This is due to the fact that for generic multivariate distributions 
the KuUback-Leibler distance can be written as K{'Si,'S2) = tr[/(S2 ISi)], 
where / is a function independent of the correlation structure of the system 
(Biroli et al. (2007)). 



It is interesting to compare the expression of KuUback-Leibler distance for 
Gaussian and Student's variables. Specifically, we can compare Eq. f|T5|l and 
Eq. fl2^ . The right hand side of both the equations is the sum of two terms. 
The ffist term, |log(|§f|); is the same for both the equations. Let's then 
focus our attention on the second terms of the equations and, in particular, 

. Suppose that 
. Then, under 



on the second term of Eq. (122|) . which is |iV log -^tr (Y:2^'E 
the correlation matrix Si is very close to S2, i.e. Si ~ 
this assumption, -^ti (T:2^'Ei) — 1- Under this assumption, at ffist order in 



N 



tr 



(Sa-'Si 



— 1, we have 



N 



log 



N 



-tr 



(s^-^s, 



N 



N 



-tr 



(s^-^s, 



(23) 
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which is exactly the second term of the right hand side of Eq. (fTSll . This calcu- 
lation shows that whenever the correlation matrices involved in the Kullback- 
Leibler distance are very close one to the other (Si ~ S2) then the expres- 
sion of the Kullback-Leibler distance for Student's t-distributions with small 
fi coincides, at the first order of approximation, with the expression of the 
Kullback-Leibler distance obtained for Gaussian random variables. 

As a final remark we note that there is way to compute the expected value of a 
Kullback-Leibler distance which does not make use of the Pearson estimator. 
It is known that the Pearson's estimator of the correlation matrix is not the 
maximum likelihood estimator when the variables are non-Gaussian. In the 
case of the Student's t-distribution of Eq. [20] there exists a recursive equation 
for the maximum likelihood estimator C which is (Bouchaud and Potters, 
2003) 

^ _ N + fi ^ (^24) 

Since the maximum likelihood estimator of Eq. f l2^ of correlation matrix 
follows a Wishart distribution in the large limit, then the expectation values 
above can be calculated by making use of Wishart theory also in the case of 
Student's variable (Biroli et al., 2007; Tumminello et al., 2007b). 

Finally, it is worth noting that there exists a straightforward modification of 
the HNFM of Eq. (I6l) to generate hierarchically organized random variables 
with the multivariate Student's t-distribution. Specifically, we consider 



6 Comparison of filtering procedures 

The KL distance can therefore be used to quantify and compare the perfor- 
mance of different filtering procedures of correlation matrices (Tumminello et 
al., 2007a). A good filtering procedure should have two important properties: 
(i) being able to remove the "right" amount of noise from the data in order to 
recover the signal and (ii) produce filtered matrices which are stable when one 
makes different observations of the same system. These two requirements are 
often in competition one with the other. The proposed procedure to evaluate 
the performance of a filtering procedure is the following. 

Suppose we are given with a data sample X and we have our favorite filtering 
procedure. We generate M bootstrap replicas Xj (i = 1, .., M) of the data. We 
then compute the sample correlation matrix Cj and apply the filtering proce- 
dure obtaining the filtered matrix Cf*'* to each replica X,. In order to measure 
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the stability of the filtering procedure, we consider the average of the quan- 
tity -ft'(Cf*'*, Cj''*) over the replicas. An optimal filtering procedure should be 
perfectly stable (i.e. (i^(Cf' *, Cf *)) = 0) because from each realization the 
filtering recovers the model matrix. In order to measure the filtered informa- 
tion we consider the average of K{Ci, Cf*'*) over the replicas. This quantity 
measures the information present in the sample correlation matrix Cj that 
has been discarded by the filtering procedure. We have seen above that for 
Gaussian variables the KL distance {K{Ci, 5])) is different from zero and inde- 
pendent from the model S (see Eq. (fT7|) ). Therefore if our filtering procedure 
is recovering the true underlying model we should expect that iir(Cj, Cf*'*) 
is equal to the right hand side of Eq. (ITTIl . We have thus a reference value 
for both the stability and the information expected from an optimal filtering 
and these values are independent from the underlying model. We represent 
the result of the analysis in a plane where the x axis reports the stability 
(i^'(Cf*^*, Cj*^*)) and the y axis reports the information {K{Ci, Cf*^*)). In this 
plane the optimal point, labeled S, has coordinate x = and y equal to the 
right hand side of Eq. [T71 A filtering procedure will be considered good if the 
corresponding point in the stability- information plane is close to S. 

To provide representative examples of quantitative analysis with the Kullback- 
Leibler distance of filtering procedures based on hierarchical clustering, as in 
the other sections, we consider SLCA and ALGA. We also consider filtering 
procedures based on the shrinkage technique (Ledoit and Wolf, 2003) and on 
the random matrix theory (RMT) (Laloux et al., 1999; Plerou et al., 1999; 
Rosenow et al., 2002; Potters et al., 2005). For the case of the the shrinkage 
procedure we construct a filtered matrix as 

C^^«(a) = aT + (1 -a)C, (26) 

where < a < 1 and T is a target matrix. As commonly done in financial 
literature, we choose the target matrix as a matrix with tu = 1 and tij = (cij) 
for i j. We estimate the performance of the shrinkage procedure for different 
values of a. It is interesting to note that there exist analytical methods to 
obtain the optimal value a* according to a cost function based on standard 
quadratic (or Frobenius) norm (Schafer and Strimmer, 2005). In the figures 
we also show the point (labeled C'^^'^(a*)) corresponding to the value a*. 

In the econophysics literature, a widespread filtering procedure of the corre- 
lation matrix is based on the random matrix theory (Metha, 1990). If the 
variables are independent and with finite variance then in the limit T, N oo, 
with a fixed ratio Q = T/N > 1, the eigenvalues of the Pearson sample correla- 
tion matrix C is bounded from above by the value Xmax = c'"^(H-l/Q+2-y/l/(5) 
where cr^ = 1 for correlation matrices. In some practical cases, such as for 
example in finance, one finds that the largest eigenvalue Ai of the empir- 
ical correlation matrix is definitely inconsistent with RMT. In these cases. 
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Fig. 8. Stability of the filtered matrix (x axis) against the amount of information 
about the correlation matrix that is retained in the filtered matrix (y axis). The 
points labeled with C'^^^(a) correspond to the shrinkage procedure (see Eq. [26]) 
and the parameter a goes from to 1 when one goes from the bottom right to the 
top left corner. Left panel shows the result for a block diagonal model of A'^ = 100 
elements divided in 12 groups and simulated for T = 748 points. Right panel shows 
the result for a hierarchically nested model of 100 elements following the HNFM 
with 23 factors of Tumminello et al. (2007d). 

Laloux et al. (1999) propose to modify the null hypothesis so that correla- 
tions can be explained in terms of a one factor model and cr^ = 1 — Xi/N. 
The filtering procedure considered here has been proposed by Potters et al. 
(2005) and it works as follows. One diagonalizes the correlation matrix and 
replaces the all eigenvalues smaller than Xmax in the diagonal matrix with 
their average value. Then one retransforms the modified diagonal matrix in 
the standard basis obtaining a matrix Urmt of elements h^^'^'^ to preserve the 
trace. Finally, the filtered correlation matrix C^^^ is the matrix of elements 

„RMT _ V.RMT I lv.RMT uRMT 
— 'Hj I \l'Hi "'jj 

In fig. [8] we show the KL distance in the plane stability-information for these 
filtering procedures applied to artificial data generated according to two dif- 
ferent models. The left panel shows the result for a model whose correlation 
matrix is block diagonal with 12 blocks, whereas the right panel shows the 
result for a HNFM with 23 factors. In both panels we show the points corre- 
sponding to the RMT, SLCA, and ALCA filtering procedures. We also show 
the points corresponding to the shrinkage filtering procedure of Eq. [26] for 
different values of a. The shrinkage method is capable to achieve a very good 
compromise between stability and information. From this analysis it is possi- 
ble to extract an optimal value of a minimizing the distance from the point 
labeled with S. It should be noted that this value in general does not coincide 
with the value a* obtained with the method of Schafer and Strimmer (2005), 
i.e. by minimizing the Frobenius norm. 

We now consider an application to a real system. We investigate the daily 



28 



increasing stability 




£5 
O 

o 

h- 



I— '• 

o 

t3 



Fig. 9. Stability of the filtered matrix (x axis) against the amount of information 
about the correlation matrix that is retained in the filtered matrix (y axis) for 
N = 100 stocks of the NYSE in the period 2001-2003 (T = 748). The points labeled 
with C^^^{a) correspond to the shrinkage procedure (see Eq. l26|) and the parameter 
a goes from to 1 when one goes from the bottom right to the top left corner. The 
point C^^^{aK) is obtained for a = ax = 0.55, which is the value of the shrinkage 
parameter that minimize the euclidean distance in the plane stability-information 
between the curve corresponding to C'^^^(a) and the point associated with S. The 
latter point represents the expectation value for a filtering procedure able to recover 
the true correlation matrix of the system. See the text for more details. 

returns of = 100 highly capitalized stocks traded at the NYSE in the 
period 2001-2003 (T = 748). In present study, differently than in Tumminello 
et al. (2007a) where we worked in the Gaussian approximation, we assume 
stock returns to be Student's t-distributed according to Eq. ( |20l) . . The scaling 
parameter n in the distribution of Eq. (I20l) is assumed to be the same for all of 
the stocks in the portfolio. Accordingly, we determine as the average value 
of the maximum likelihood estimates of independently evaluated for each 
stock. The result is /i = 5.9 with a standard deviation o"^ = 1.8. The ratio 



^/N = 0.059 is much less than one and therefore ensures that Eq. (1^21) is a 
good approximation of the KuUback-Leibler distance for the present case. In 
Fig. [9] we show the performance of different filtering procedures in the plane 
stability-information. In the figure the Kullback-Leibler distance is calculated 
according to Eq. (J22l) . 
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The filtering procedures based on RMT, SLCA and ALCA have different 
properties in terms of stability and information (Tumminello et al., 2007a). 
SLCA is the most stable even if it is the least informative, whereas RMT is 
the least stable but the most informative. ALCA has intermediate proper- 
ties both with respect to stability and to information. The filtering procedure 
based on shrinkage seems to outperform the other filtering techniques for se- 
lected values of the a parameter. In Fig. [9l we also highlight (red color) a 
point corresponding to the optimal value = 0.55 of the shrinkage pa- 
rameter a in the plane stability-information. This point is obtained by min- 
imizing the Euclidean distance between the points (0, < i^'(Cj,S) >) and 
(< ir(CP^(a),Cf^^(a)) >, < K(C„Cf^^(a)) >) in the space stability- 
information. The point (0, < -ft'(Ci, S) >) is expected for a filtering proce- 
dure able to perfectly detect the correlation matrix S of the system. Our 
aim is now to estimate < -ft'(Cj, S) >. By using the estimate of /z, we eval- 
uate the quantity < -ft'(Cj, S) > by exploiting the model independency of 
the KuUback-Leibler distance. During our analysis we realized that for Stu- 
dent's t-distributions the bootstrap replicas, which we actually use to deal 
with real data, can introduce a bias with respect to independent simulations. 
By performing numerical simulations of Student's t-distributed random vari- 
ables characterized by different values of /i, we notice that this bias does not 
appear in the case of normal or close to normal distributions (typically when 
\x is greater than 10). In order to overcame the problem when Student's t- 
distributed random variables with a low value of /i describe the data better 
than Gaussian random variables, we perform 100 independent simulations of 
Student's t-distributed data series of length T = 748 (the same as real data) 
according to the model of Eq. fl25|) with scaling parameter /i = 5.9 and we 
construct 100 bootstrap replicas of each simulated data series. It is to notice 
that the choice of Eq. fl25|) does not affect the generality of results because the 
expectation values of the Kullback-Leibler distance do not depend on the cor- 
relation structure of the model. We indicate the correlation matrix of simulated 
series with (j = 1, 100), and the correlation matrix of a bootstrap replica 
associated with Cj with C^j (i=l,...,100). Because the expectation value of the 
correlation matrix C^j is and the expectation values of the Kullback-Leibler 
distance are model independent, we can estimate the value of < -ft'(Cj, S) > 
as < -ft'(C^j, Cj) >, where the average is taken over both the indices i and j. 
The result we obtain < /^(C^-, Cj) >= 6.01. This result is shown in the Fig. [H 
as a blue circle. Finally, in order to associate an error bar with this value, we 
apply the same procedure used to obtain the estimate of < K{Q,i, S) > for 
values of \x equal to \imin = A* ~ cT/x = 5.9 — 1.8 = 4.1 (providing the value at 
the top of the error bar in the figure) and [imax = + o"^! = 5.9 + 1.8 = 7.7 
(providing the value at the bottom of the error bar in the figure). A series of 
numerical simulations performed with different model generating the dynamics 
of the random variables show that the bias introduced by bootstrapping data 
instead of performing independent simulations turns out to be approximately 
independent of the actual correlation matrix of the system and for [i = 5.9, 
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N=100, and T = 748 the bias is equal to —10.8% ± 1.7%, i.e. the bootstrap 
based estimation gives an underestimate of < i^(Cj, S) >. However, in the 
investigations summarized in Fig. M the bias is the same for all of the points 
and therefore the comparison of filtering procedure is possible. Our numerical 
simulations also show that the value of the bias tends to increase as the value 
of /i decreases. 

It is also worth noting that results very similar to those shown in Fig. [9] can 
be obtained by using the expression of Eq. (fT5|) (valid for Gaussian variables) 
to evaluate the Kullback -Leibler distance instead of Eq. (122 p (valid for Stu- 
dent's t-distributed variables with fi/N << 1). This fact can be interpreted 
by observing that the correlation matrices involved in the Kullback-Leibler 
distance do not differ much one from the other and, therefore, Eq. fl22l) gives 
estimates of the Kullback-Leibler distance similar to those obtained by us- 
ing Eq. f[T^ that is strictly speaking only valid for Gaussian variables, and 
which has been used in Tumminello et al. (2007a). A final remark concern 
the shrinkage technique. We note that the shrinkage parameter ax = 0.55 is 
significantly different than a* = 0.16, obtained by minimizing the Frobenius 
norm. The point associated with a* = 0.16 (green circle in Fig. [9]) in the plane 
stability-information is quite far from the point of an ideal filtering procedure 
able to perfectly detect the correlation matrix S of the system (blue circle in 
Fig. [9]). This observation suggests that by using the Frobenius distance to get 
an estimate of the shrinkage parameter a one puts too much faith on the sta- 
tistical robustness of the sample correlation matrix. Conversely, ax represents 
a more reliable estimate of the optimal shrinkage parameter. 



7 Conclusions 

This paper discusses several methods to quantitatively investigate the proper- 
ties of the correlation matrix of a system of N elements. In the present work we 
consistently investigate the correlation matrix of the synchronous dynamics of 
the returns of a portfolio of financial assets. However, our results apply to any 
correlation matrix computed starting from the series of T records belonging 
to N elements of a system of interest. 

Specifically, we discuss how to associate to a correlation matrix a hierarchical 
tree and correlation based trees or graphs. In previous papers we have shown 
that the information selected through these clustering procedures and the 
construction of correlation based trees or graphs are pointing out interesting 
details on the investigated system. For example, the hierarchical clustering is 
able to detect clusters of stocks belonging to the same sectors or sub-sectors 
of activities without the need of any supervision of the clustering procedure. 
We have also shown that the information present in correlation based trees 
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and graphs provides additional clues about the interrelations among stocks of 
different economic sectors and sub-sectors. It is worth noting that this kind of 
information is not present in the information stored in the hierarchical trees 
obtained by ALCA and SLCA clustering procedures or, equivalently, in the 
associated ultrametric correlation matrices. The information obtained from 
what we call the "filtering procedure" of the correlation matrix is subjected 
to statistical uncertainty. For this reason, we discuss a bootstrap methodology 
able to quantify the statistical robustness of both the hierarchical trees and 
correlation based trees or graphs. 

The hierarchical trees and correlation based trees and graphs associated with 
portfolios of stocks traded in financial markets often show clusters of stocks 
partitioned in sub-clusters, sub-clusters partitioned in sub-sub-clusters and 
so on until the level of the single stock. The ubiquity of this observation 
has motivated us to develop a hierarchically nested factor model able to fully 
describe this property. Our model is a nested factor model characterized by the 
same correlation matrix as the empirical set of data. The model is expressed 
in a direct and simple form when all the correlation coefficients are positive 
or very close to positive (for a precise definition of the limits of validity of this 
extension see Section [3]). The number of factors of the model is by construction 
equal to the number of elements of the system. Again the selection of the most 
statistically reliable factors detected in a real system is obtained by a procedure 
based on bootstrap with a bootstrap threshold selected in a self-consistent way. 

The amount of information and the statistical stability of filtering procedures 
of the correlation matrix are quantified by using the Kullback-Leibler distance. 
We report and discuss analytical results both for Gaussian and for Student's 
t-distributed multivariate time series. In both cases the expectation values of 
the Kullback-Leibler distance are model independent, indicating that this dis- 
tance is a good estimator of the statistical uncertainty due to the finite size of 
the empirical sample. These properties are not observed in other widespread 
distances between matrices such as, for example , the Frobenius distance, 
which is a standard measure of the distance between matrices. In our example 
with real data, we estimate the amount of information retained and the sta- 
bility of the filtering procedure used in a data set of 100 stocks approximately 
described by a multivariate Student's t distribution. For this set of data we are 
able to discriminate among filtering procedures as different as ALCA, SLCA, 
random matrix theory and a shrinkage procedure. 
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Table 1 

lA- Stocks with tick symbol from ABT to IGT. The first column is the tick symbol 
in alphabetical order, the second column reports an abbreviation of the economic 
sector of the considered company. Specifically, we have Basic Materials (BM), Con- 
sumer Cyclical (CC), Consumer Non Cyclical (CNC), Energy (E), Services (S), 
Financial (F), Healthcare (H), Technology (T), Utilities (U), Transportation (TR), 
Conglomerates (CO) and Capital Goods (CG). The third column indicates the eco- 
nomic sub-sector of the company. The forth column reports the company name 
whereas the fifth column is the numerical label of the stock used in Fig.s [6] and [71 



tick 


sector 


sub-sector 


name 


ord 


ABT 


H 


Major Drugs 


Abbott Laboratories 


72 


ADI 


T 


Semiconductors 


Analog Devices Inc 


48 


AFL 


F 


Insurance Accidental & Health 


Aflac Inc 


62 


AIG 


F 


Insurance Prop. & Casualty 


American Intl Group Inc 


16 


ALL 


F 


Insurance Prop. & Casualty 


Allstate Corp The 


60 


AVP 


CNC 


Personal & Household Products 


Avon Products Inc 


82 


AXP 


F 


Consumer Financial Services 


American Express Company 


6 


BA 


CG 


Aerospace & Defense 


Boeing Co 


36 


BAG 


F 


Money Center Banks 


Bank Of America Corp 


12 


BAX 


H 


Medical Equipment & Supplies 


Baxter International Inc 


78 


BBY 


S 


Retail Technology 


Best Buy Co Inc 


43 


BK 


F 


Money Center Banks 


Bank Of New York Inc 


14 


BLS 


S 


Communication Services 


Bellsouth Corporation 


67 


BMY 


H 


Major Drugs 


Bristol Myers Squibb Company 


74 


BNI 


TR 


Railroad 


Burlington Nrthrn Santa Fe Com 


38 


BSC 


F 


Investment Services 


Bear Stearns Companies Inc 


2 


BSX 


H 


Medical Equipment & Supplies 


Boston Scientific Corp 


97 


BUD 


CNC 


Beverages Alcoholic 


Anheuser Busch Cos Inc 


87 


CA 


T 


Software &: Programming 


Computer Associates Intl Inc 


52 


GAG 


CNC 


Food Processing 


Conagra Foods Inc. 


91 


CAH 


H 


Biotechnology & Drugs 


Cardinal Health Inc 


75 


CAT 


CG 


Constr. & Agric. Machinery 


Caterpillar Inc 


29 


ecu 


S 


Broadcasting & Cable TV 


Clear Channel Communictns Inc 


22 


CI 


F 


Insurance Accidental & Health 


Cigna Corp 


63 


CL 


CNC 


Personal & Household Products 


Colgate-palmolive Co 


80 


DD 


BM 


Chemical - Plastic & Rubber 


Du Pont De Nemours E I Co 


25 


DE 


CG 


Constr. & Agric. Machinery 


Deere Co 


30 


DHR 


T 


Scientific & Technical Instr. 


Danaher Corp 


32 


DIS 


S 


Broadcasting & Cable TV 


Walt Disney Co-disney Common 


21 


DOW 


BM 


Chemical - Plastic & Rubber 


Dow Chemical Co 


27 


DUK 


U 


Electric Utilities 


Duke Energy Corporation 


93 



35 



Table 2 



IB- Stocks with tick symbol from EMC to LOW. The content of columns is the 
same as in Table lA. 



tick 


. ' V » V I V ' 1 


m 1 1^ - p 1" o T* 




ord 




i 


Gomputer Storage Devices 


Emc Corporation 




T71A /r"D 


GO 


Conglomerates 


Emerson Electric Co 


66 


r JJU 


1 


Gomputer Services 


First Data Corp 


40 


r iNM 


b 


Consumer Financial Services 


Fannie Mae 


c o 
00 




S 


Communication Services 


Sprint Gorp l^on Group 


O 

05 




b 


Consumer Financial Services 


i'reddie Mac D/b/a Votmg 


59 






Personal &; Household Products 


ijiuette uo 


OO 




o 
D 


Printing & Publishing 


Gannett Co Inc 


1 n 


\jL) 




Aerospace &: Defense 


General Dynamics Corp 


yo 


\jU i 


XT 

xl 


Medical Equipment & Supplies 


Guidant Corp 


yo 




r 


S&Ls/Savings Banks 


Golden West Financial Corp 


Oi 






Conglomerates 


General Electric 


r 

5 






Food Processing 


General Mills Inc 


on 

yu 


GM 


CG 


Auto & Truck Manufacturers 


General Motors Gorp 


34 


GPS 


S 


Retail Apparel 


Gap Inc The 


45 


HD 


S 


Retail Home Improvement 


Home Depot Inc 


39 


HDI 


GG 


Recreational Products 


Harley Davidson Inc 


24 


IBM 


T 


Computer Hardware 


Intl Business Machines Corp 


50 


IGT 


S 


Casinos & Gaming 


Intl Game Technology 


56 


IP 


BM 


Paper &; Paper Products 


International Paper Co 


28 


ITW 


GG 


Misc. Capital Goods 


Illinois Tool Works 


31 


JNJ 


H 


Major Drugs 


Johnson And Johnson 


71 


K 


GNG 


Food Processing 


Kellogg Go 


89 


KMB 


BM 


Paper Sz Paper Products 


Kimberly Clark Corp 


81 


KO 


GNG 


Beverages Non-Alcoholic 


Coca-cola Co 


84 


KR 


S 


Retail Grocery 


Kroger Go 


94 


KRB 


F 


Regional Banks 


MBNA Corp 


7 


KSS 


S 


Retail Department & Discount 


Kohls Gorp 


42 


LEH 


F 


Investment Services 


Lehman Brothers Holdings 


1 


LLY 


H 


Major Drugs 


Lilly Eli Co 


73 


LOW 


S 


Retail Home Improvement 


Lowes Companies Inc 


40 
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Table 3 

IC- Stocks with tick symbol from MCD to WMT. The content of columns is the 

same as in Tabic lA. 



tick 


sector 


sub-sector 


name 


ord 


MCD 


S 


Restaurants 


Mcdonalds Corp 


99 


MDT 


H 


Medical Equipment & Supplies 


Medtronic Inc 


77 


MEL 


F 


Investment Services 


Mellon Financial Corp 


15 


MER 


F 


Investment Services 


Merrill Lynch Co Inc 


3 


MMC 


F 


Insurance Miscellaneous 


Marsh Mclennan Cos Inc 


18 


MOT 


T 


Communication Equipment 


Motorola Inc 


51 


MRK 


H 


Major Drugs 


Merck Co Inc 


70 


NCC 


F 


Regional Banks 


National City Corp 


8 


NEM 


BM 


Gold & Silver 


Newmont Mining Corp Holding C 


100 


NOC 


CG 


Aerospace &; Defense 


Northrop Grumman Cp Hldg Co 


96 


OMC 


S 


Advertising 


Omnicom Group Inc 


23 


ONE 


F 


Regional Banks 


Bank One Corp 


10 


OXY 


E 


Oil & Gas Operations 


Occidental Petroleum Corp 


54 


PEP 


CNC 


Beverages Non- Alcoholic 


Pepsico Inc 


85 


PFE 


H 


Major Drugs 


Pfizer Inc 


69 


PG 


CNC 


Personal &: Household Products 


Procter Gamble Co 


79 


PGR 


F 


Insurance Prop. & Casualty 


Progressive Corp 


17 


PNC 


F 


Regional Banks 


Pnc Finl Svcs Grp Inc The 


11 


PPG 


BM 


Chemical Manifacturing 


Ppg Industries Inc 


26 


RD 


E 


Oil & Gas - Integrated 


Royal Dutch Pet New 1.25gldrs 


55 


S 


S 


Retail Department & Discount 


Sears Roebuck Co 


44 


SBC 


S 


Communication Services 


Sbc Communications Inc 


66 


SCH 


F 


Investment Services 


Schwab Charles Corp 


4 


SGP 


H 


Major Drugs 


Schering Plough Corp 


76 


SLB 


E 


Oil Well Services &: Equipment 


Schlumberger Ltd 


53 


SLE 


CNC 


Food Processing 


Sara Lee Corp 


88 


SO 


U 


Electric Utilities 


Southern Co 


92 


STI 


F 


Regional Banks 


Suntrust Banks Inc 


9 


SYY 


S 


Retail Grocery 


Sysco Corp 


86 


TRB 


s 


Printing & Publishing 


Tribune Company 


20 


TXN 


T 


Semiconductors 


Texas Instruments 


47 


TYC 


CO 


Conglomerates 


Tyco International Ltd New 


65 


UNP 


TR 


Railroad 


Union Pacific Corporation 


37 


UTX 


CO 


Conglomerates 


United Technologies Corp 


35 


WAG 


s 


Retail Drugs 


Walgreen Company 


57 


WFC 


F 


Money Center Banks 


Wells Fargo Co New 


13 


WLP 


F 


Insurance Accidental k. Health 


Wellpoint Hlth Netwks Hldg Co 


64 


WMT 


S 


Retail Department &; Discount 


Wal-mart Stores Inc 


41 



37 



